Search CORE

17 research outputs found

Contention-Aware Scheduling for SMT Multicore Processors

Author: Feliu Pérez Josué
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 27/03/2017
Field of study

The recent multicore era and the incoming manycore/manythread era generate a lot of challenges for computer scientists going from productive parallel programming, over network congestion avoidance and intelligent power management, to circuit design issues. The ultimate goal is to squeeze out as much performance as possible while limiting power and energy consumption and guaranteeing a reliable execution. The increasing number of hardware contexts of current and future systems makes the scheduler an important component to achieve this goal, as there is often a combinatorial amount of different ways to schedule the distinct threads or applications, each with a different performance due to the inter-application interference. Picking an optimal schedule can result in substantial performance gains. This thesis deals with inter-application interference, covering the problems this fact causes on performance and fairness on actual machines. The study starts with single-threaded multicore processors (Intel Xeon X3320), follows with simultaneous multithreading (SMT) multicores supporting up to two threads per core (Intel Xeon E5645), and goes to the most highly threaded per-core processor that has ever been built (IBM POWER8). The dissertation analyzes the main contention points of each experimental platform and proposes scheduling algorithms that tackle the interference arising at each contention point to improve the system throughput and fairness. First we analyze contention through the memory hierarchy of current multicore processors. The performed studies reveal high performance degradation due to contention on main memory and any shared cache the processors implement. To mitigate such contention, we propose different bandwidth-aware scheduling algorithms with the key idea of balancing the memory accesses through the workload execution time and the cache requests among the different caches at each cache level. The high interference that different applications suffer when running simultaneously on the same SMT core, however, does not only affect performance, but can also compromise system fairness. In this dissertation, we also analyze fairness in current SMT multicores. To improve system fairness, we design progress-aware scheduling algorithms that estimate, at runtime, how the processes progress, which allows to improve system fairness by prioritizing the processes with lower accumulated progress. Finally, this dissertation tackles inter-application contention in the IBM POWER8 system with a symbiotic scheduler that addresses overall SMT interference. The symbiotic scheduler uses an SMT interference model, based on CPI stacks, that estimates the slowdown of any combination of applications if they are scheduled on the same SMT core. The number of possible schedules, however, grows too fast with the number of applications and makes unfeasible to explore all possible combinations. To overcome this issue, the symbiotic scheduler models the scheduling problem as a graph problem, which allows finding the optimal schedule in reasonable time. In summary, this thesis addresses contention in the shared resources of the memory hierarchy and SMT cores of multicore processors. We identify the main contention points of three systems with different architectures and propose scheduling algorithms to tackle contention at these points. The evaluation on the real systems shows the benefits of the proposed algorithms. The symbiotic scheduler improves system throughput by 6.7\% over Linux. Regarding fairness, the proposed progress-aware scheduler reduces Linux unfairness to a third. Besides, since the proposed algorithm are completely software-based, they could be incorporated as scheduling policies in Linux and used in small-scale servers to achieve the mentioned benefits.La actual era multinúcleo y la futura era manycore/manythread generan grandes retos en el área de la computación incluyendo, entre otros, la programación paralela productiva o la gestión eficiente de la energía. El último objetivo es alcanzar las mayores prestaciones limitando el consumo energético y garantizando una ejecución confiable. El incremento del número de contextos hardware de los sistemas hace que el planificador se convierta en un componente importante para lograr este objetivo debido a que existen múltiples formas diferentes de planificar las aplicaciones, cada una con distintas prestaciones debido a las interferencias que se producen entre las aplicaciones. Seleccionar la planificación óptima puede proporcionar importantes mejoras de prestaciones. Esta tesis se ocupa de las interferencias entre aplicaciones, cubriendo los problemas que causan en las prestaciones y equidad de los sistemas actuales. El estudio empieza con procesadores multinúcleo monohilo (Intel Xeon X3320), sigue con multinúcleos con soporte para la ejecución simultanea (SMT) de dos hilos (Intel Xeon E5645), y llega al procesador que actualmente soporta un mayor número de hilos por núcleo (IBM POWER8). La disertación analiza los principales puntos de contención en cada plataforma y propone algoritmos de planificación que mitigan las interferencias que se generan en cada uno de ellos para mejorar la productividad y equidad de los sistemas. En primer lugar, analizamos la contención a lo largo de la jerarquía de memoria. Los estudios realizados revelan la alta degradación de prestaciones provocada por la contención en memoria principal y en cualquier cache compartida. Para mitigar esta contención, proponemos diversos algoritmos de planificación cuya idea principal es distribuir los accesos a memoria a lo largo del tiempo de ejecución de la carga y las peticiones a las caches entre las diferentes caches compartidas en cada nivel. Las altas interferencias que sufren las aplicaciones que se ejecutan simultáneamente en un núcleo SMT, sin embargo, no solo afectan a las prestaciones, sino que también pueden comprometer la equidad del sistema. En esta tesis, también abordamos la equidad en los actuales multinúcleos SMT. Para mejorarla, diseñamos algoritmos de planificación que estiman el progreso de las aplicaciones en tiempo de ejecución, lo que permite priorizar los procesos con menor progreso acumulado para reducir la inequidad. Finalmente, la tesis se centra en la contención entre aplicaciones en el sistema IBM POWER8 con un planificador simbiótico que aborda la contención en todo el núcleo SMT. El planificador simbiótico utiliza un modelo de interferencia basado en pilas de CPI que predice las prestaciones para la ejecución de cualquier combinación de aplicaciones en un núcleo SMT. El número de posibles planificaciones, no obstante, crece muy rápido y hace inviable explorar todas las posibles combinaciones. Por ello, el problema de planificación se modela como un problema de teoría de grafos, lo que permite obtener la planificación óptima en un tiempo razonable. En resumen, esta tesis aborda la contención en los recursos compartidos en la jerarquía de memoria y el núcleo SMT de los procesadores multinúcleo. Identificamos los principales puntos de contención de tres sistemas con diferentes arquitecturas y proponemos algoritmos de planificación para mitigar esta contención. La evaluación en sistemas reales muestra las mejoras proporcionados por los algoritmos propuestos. Así, el planificador simbiótico mejora la productividad, en promedio, un 6.7% con respecto a Linux. En cuanto a la equidad, el planificador que considera el progreso consigue reducir la inequidad de Linux a una tercera parte. Además, dado que los algoritmos propuestos son completamente software, podrían incorporarse como políticas de planificación en Linux y usarse en servidores a pequeña escala para obtener los benefiL'actual era multinucli i la futura era manycore/manythread generen grans reptes en l'àrea de la computació incloent, entre d'altres, la programació paral·lela productiva o la gestió eficient de l'energia. L'últim objectiu és assolir les majors prestacions limitant el consum energètic i garantint una execució confiable. L'increment del número de contextos hardware dels sistemes fa que el planificador es convertisca en un component important per assolir aquest objectiu donat que existeixen múltiples formes distintes de planificar les aplicacions, cadascuna amb unes prestacions diferents degut a les interferències que es produeixen entre les aplicacions. Seleccionar la planificació òptima pot donar lloc a millores importants de les prestacions. Aquesta tesi s'ocupa de les interferències entre aplicacions, cobrint els problemes que provoquen en les prestacions i l'equitat dels sistemes actuals. L'estudi comença amb processadors multinucli monofil (Intel Xeon X3320), segueix amb multinuclis amb suport per a l'execució simultània (SMT) de dos fils (Intel Xeon E5645), i arriba al processador que actualment suporta un major nombre de fils per nucli (IBM POWER8). Aquesta dissertació analitza els principals punts de contenció en cada plataforma i proposa algoritmes de planificació que aborden les interferències que es generen en cadascun d'ells per a millorar la productivitat i l'equitat dels sistemes. En primer lloc, estudiem la contenció al llarg de la jerarquia de memòria en els processadors multinucli. Els estudis realitzats revelen l'alta degradació de prestacions provocada per la contenció en memòria principal i en qualsevol cache compartida. Per a mitigar la contenció, proposem diversos algoritmes de planificació amb la idea principal de distribuir els accessos a memòria al llarg del temps d'execució de la càrrega i les peticions a les caches entre les diferents caches compartides en cada nivell. Les altes interferències que sofreixen las aplicacions que s'executen simultàniament en un nucli SMT, no obstant, no sols afecten a las prestacions, sinó que també poden comprometre l'equitat del sistema. En aquesta tesi, també abordem l'equitat en els actuals multinuclis SMT. Per a millorar-la, dissenyem algoritmes de planificació que estimen el progrés de les aplicacions en temps d'execució, el que permet prioritzar els processos amb menor progrés acumulat para a reduir la inequitat. Finalment, la tesi es centra en la contenció entre aplicacions en el sistema IBM POWER8 amb un planificador simbiòtic que aborda la contenció en tot el nucli SMT. El planificador simbiòtic utilitza un model d'interferència basat en piles de CPI que prediu les prestacions per a l'execució de qualsevol combinació d'aplicacions en un nucli SMT. El nombre de possibles planificacions, no obstant, creix molt ràpid i fa inviable explorar totes les possibles combinacions. Per resoldre aquest contratemps, el problema de planificació es modela com un problema de teoria de grafs, la qual cosa permet obtenir la planificació òptima en un temps raonable. En resum, aquesta tesi aborda la contenció en els recursos compartits en la jerarquia de memòria i el nucli SMT dels processadors multinucli. Identifiquem els principals punts de contenció de tres sistemes amb diferents arquitectures i proposem algoritmes de planificació per a mitigar aquesta contenció. L'avaluació en sistemes reals mostra les millores proporcionades pels algoritmes proposats. Així, el planificador simbiòtic millora la productivitat una mitjana del 6.7% respecte a Linux. Pel que fa a l'equitat, el planificador que considera el progrés aconsegueix reduir la inequitat de Linux a una tercera part. A més, donat que els algoritmes proposats son completament software, podrien incorporar-se com a polítiques de planificació en Linux i emprar-se en servidors a petita escala per obtenir els avantatges mencionats.Feliu Pérez, J. (2017). Contention-Aware Scheduling for SMT Multicore Processors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/79081TESISPremios Extraordinarios de tesis doctorale

Crossref

RiuNet

Planificación de tareas considerando la contención en la jerarquía de memoria de procesadores multinúcleo

Author: Feliu Pérez Josué
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 25/02/2013
Field of study

[ES] Con el objetivo de mejorar el rendimiento de los CMPs, parte de la investigación reciente se ha centrado en la planificación de procesos para limitar la contención provocada por el limitado ancho de banda. Hoy en dia, los CMPs comerciales implementan jerarquías de cache multinivel, donde las caches del último nivel son compartidas por al menos dos caches del nivel inferior. A su vez, estas caches pueden ser compartidas por varios núcleos multihilo. Con este diseño, los puntos de contención pueden aparecer a lo largo de toda la jerarquía de memoria del procesador. Además, el rendimiento de las cargas varía ampliamente en función de rendimiento de las aplicaciones en la LLC. Se espera que estos problemas se agraven en futuros procesadores, pues el número de núcleos, hilos de ejecución, y por tanto, grado de compartición de las caches se incrementa en cada generación. En este trabajo presentamos un conjunto de tests para determinar de manera experimental las características de la LLC, es decir, la geometría de la cache (número de conjuntos, número de vías y tamaño de la línea) y su arquitectura (privada o compartida) utilizando huge pages y contadores de prestaciones. Después, se caracteriza el impacto en el rendimiento de la contención en los diferentes puntos de contención que aparecen en el subsistema de memoria. Finalmente, proponemos una política de planificación genérica para CMPs que considera el ancho de banda disponible en cada nivel de la jerarquía de cache y la degradación del IPC que cada benchmark sufre debida a la contención en memoria principal. El algoritmo propuesto selecciona los procesos para ser ejecutados concurrentemente y los ubica en los núcleos para minimizar los efectos de la contención. La propuesta se ha implementado y evaluado en un procesador comercial con cuatro núcleos y sin capacidades multi-hilo, que implementa una jerarquía de cache relativamente pequeña con dos niveles de cache. A pesar de ello, la propuesta ofrece un incremento de prestaciones medio superior al 6.5% al compararla con la planificación de Linux, mientras que este incremento está por debajo del 3.3% para un planificador del estado del arte que considera la contención en el acceso a memoria. Es más, en algunas cargas la propuesta triplica la aceleración ofrecida por el planificador citado.[EN] In order to improve CMP performance, recent research has focused on scheduling to mitigate contention produced by the limited memory bandwidth. Nowadays, commercial CMPs implement multi-level cache hierarchies where last level caches are shared by at least two cache structures located at the immediately lower cache level. In turn, these caches can be shared by several multithreaded cores. In this microprocessor design, contention points may appear along the whole memory hierarchy. Besides, in these architectures the performance of multiprogrammed workloads widely varies depending on the performance of the LLC. Moreover, these problems are expected to aggravate in future technologies, since the number of cores and hardware threads, and consequently the size of the shared caches increases with each microprocessor generation. In this work we present a set of tests to experimentally determine the LLC features, that is, the cache geometry (number of sets, number of ways, and line size) and cache architecture type (private or shared) using huge pages and performance counters. Then we characterize the impact on performance of the different contention points that appear along the memory subsystem. Finally, we propose a generic scheduling strategy for CMPs that takes into account the available bandwidth at each level of the cache hierarchy and the IPC degradation each benchmark suffers due to main memory contention. The proposed strategy selects the processes to be co-scheduled and allocates them to cores in order to minimize contention effects. The proposal has been implemented and evaluated in a commercial single-threaded quad-core processor with a relatively small two-level cache hierarchy. Despite this small cache hierarchy presents less contention than more recent processor designs, the proposal reaches, on average, a performance improvement above 6.5% when compared with the Linux scheduler while this figure is below 3.2% for an state-of-theart memory-aware scheduler. Moreover, in some mixes the proposal triples the speedup achieved by the latter scheduler.Feliu Pérez, J. (2012). Planificación de tareas considerando la contención en la jerarquía de memoria de procesadores multinúcleo. http://hdl.handle.net/10251/27284Archivo delegad

RiuNet

Precise Runahead Execution

Author: Adileh Almutaz
Eeckhout Lieven
Feliu-Pérez Josué
Naithani Ajeya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

© 2019 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] Runahead execution improves processor performance by accurately prefetching long-latency memory accesses. When a long-latency load causes the instruction window to fill up and halt the pipeline, the processor enters runahead mode and keeps speculatively executing code to trigger accurate prefetches. A recent improvement tracks the chain of instructions that leads to the long-latency load, stores it in a runahead buffer, and executes only this chain during runahead execution, with the purpose of generating more prefetch requests during runahead execution. Unfortunately, all these prior runahead proposals have shortcomings that limit performance and energy efficiency because they discard the full instruction window to enter runahead mode and then flush the pipeline to restart normal operation. This significantly constrains the performance benefits and increases the energy overhead of runahead execution. In addition, runahead buffer limits prefetch coverage by tracking only a single chain of instructions that lead to the same long-latency load. We propose precise runahead execution (PRE) to mitigate the shortcomings of prior work. PRE leverages the renaming unit to track all the dependency chains leading to long-latency loads. PRE uses a novel approach to manage free processor resources to execute the detected instruction chains in runahead mode without flushing the pipeline. Our results show that PRE achieves an additional 21.1 percent performance improvement over the recent runahead proposals while reducing energy consumption by 6.1 percent.This research is supported through FWO grants no. G.0434.16N and G.0144.17N, and European Research Council (ERC) Advanced Grant agreement no. 741097.Naithani, A.; Feliu-Pérez, J.; Adileh, A.; Eeckhout, L. (2019). Precise Runahead Execution. IEEE Computer Architecture Letters. 18(1):71-74. https://doi.org/10.1109/LCA.2019.2910518S717418

Crossref

Ghent University Academic Bibliography

RiuNet

Thread Isolation to Improve Symbiotic Scheduling on SMT Multicore Processors

Author: Eeckhout Lieven
Feliu-Pérez Josué
Petit Martí Salvador Vicente
Sahuquillo Borrás Julio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

© 2020 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] Resource sharing is a critical issue in simultaneous multithreading (SMT) processors as threads running simultaneously on an SMT core compete for shared resources. Symbiotic job scheduling, which co-schedules applications with complementary resource demands, is an effective solution to maximize hardware utilization and improve overall system performance. However, symbiotic job scheduling typically distributes threads evenly among cores, i.e., all cores get assigned the same number of threads, which we find to lead to sub-optimal performance. In this paper, we show that asymmetric schedules (i.e., schedules that assign a different number of threads to each SMT core) can significantly improve performance compared to symmetric schedules. To leverage this finding, we propose thread isolation, a technique that turns symmetric schedules into asymmetric ones yielding higher overall system performance. Thread isolation identifies SMT-adverse applications and schedules them in isolation on a dedicated core to mitigate their sharp performance degradation under SMT. Our experimental results on an IBM POWER8 processor show that thread isolation improves system throughput by up to 5.5 percent compared to a state-of-the-art symmetric symbiotic job scheduler.Josue Feliu has been partially supported through a postdoctoral fellowship by the Generalitat Valenciana (APOSTD/2017/052). Additional support has been provided by the Ministerio de Ciencia, Innovacion y Universidades and the European ERDF under Grant RTI2018-098156-B-C51, as well as, by the Universitat Politenica de Valencia through the "Ayudas a Primeros Proyectos de Investigacion" (PAID-06-18) under grant SP20180140. Lieven Eeckhout's research program is supported through FWO grants no. G.0434.16N and G.0144.17N, and the European Research Council (ERC) Advanced Grant agreement no. 741097.Feliu-Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Eeckhout, L. (2020). Thread Isolation to Improve Symbiotic Scheduling on SMT Multicore Processors. IEEE Transactions on Parallel and Distributed Systems. 31(2):359-373. https://doi.org/10.1109/TPDS.2019.2934955S35937331

Crossref

Ghent University Academic Bibliography

RiuNet

Addressing fairness in SMT multicores with a progress-aware scheduler

Author: Duato Marín José Francisco
Feliu Pérez Josué
Petit Martí Salvador Vicente
Sahuquillo Borrás Julio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/05/2015
Field of study

© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Current SMT (simultaneous multithreading) processors co-schedule jobs on the same core, thus sharing core resources like L1 caches. In SMT multicores, threads also compete among themselves for uncore resources like the LLC (last level cache) and DRAM modules. Per process performance degradation over isolated execution mainly depends on process resource requirements and the resource contention induced by co-runners. Consequently, the running processes progress at different pace. If schedulers are not progress aware, the unpredictable execution time caused by unfairness can introduce undesirable behaviors on the system such as difficulties to keep priority-based scheduling. This work proposes a job scheduler for SMT multicores that provides fairness to the execution of multiprogrammed workloads. To this end, the scheduler estimates per-process standalone performance by periodically creating low-contention co-schedules. These estimates are used to compute the per process progress. Then, those processes with less progress are prioritized to enhance fairness. Experimental results on a Intel Xeon with six dual-threaded SMT cores show that the proposed scheduler reduces unfairness, on average, by 3× over Linux OS. Moreover, thanks to the tread to core allocation policy, the scheduler slightly improves throughput and turnaround time.This work was supported by the Spanish Ministerio de Econom´ıa y Competitividad (MINECO) and Plan E funds, under Grant TIN2012-38341-C04-01, and by the Intel Early Career Faculty Honor Program AwardFeliu Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2015). Addressing fairness in SMT multicores with a progress-aware scheduler. IEEE. https://doi.org/10.1109/IPDPS.2015.48

Crossref

RiuNet

Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores

Author: Duato Marín José Francisco
Feliu-Pérez Josué
Petit Martí Salvador Vicente
Sahuquillo Borrás Julio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

[EN] Nowadays, high performance multicore processors implement multithreading capabilities. The processes running concurrently on these processors are continuously competing for the shared resources, not only among cores, but also within the core. While resource sharing increases the resource utilization, the interference among processes accessing the shared resources can strongly affect the performance of individual processes and its predictability. In this scenario, process scheduling plays a key role to deal with performance and fairness. In this work we present a process scheduler for SMT multicores that simultaneously addresses both performance and fairness. This is a major design issue since scheduling for only one of the two targets tends to damage the other. To address performance, the scheduler tackles bandwidth contention at the L1 cache and main memory. To deal with fairness, the scheduler estimates the progress experienced by the processes, and gives priority to the processes with lower accumulated progress. Experimental results on an Intel Xeon E5645 featuring six dual-threaded SMT cores show that the proposed scheduler improves both performance and fairness over two state-of-the-art schedulers and the Linux OS scheduler. Compared to Linux, unfairness is reduced to a half while still improving performance by 5.6 percent.We thank the anonymous reviewers for their constructive and insightful feedback. This work was supported in part by the Spanish Ministerio de Economia y Competitividad (MINECO) and Plan E funds, under grants TIN2015-66972-C5-1-R and TIN2014-62246EXP, and by the Intel Early Career Faculty Honor Program Award.Feliu-Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2017). Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores. IEEE Transactions on Computers. 66(5):905-911. https://doi.org/10.1109/TC.2016.2620977S90591166

RiuNet

Cache-Hierarchy contention-aware scheduling in CMPs

Author: Duato Marín José Francisco
Feliu Pérez Josué
Petit Martí Salvador Vicente
Sahuquillo Borrás Julio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2014
Field of study

© © 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksTo improve chip multiprocessor (CMP) performance, recent research has focused on scheduling strategies to mitigate main memory bandwidth contention. Nowadays, commercial CMPs implement multilevel cache hierarchies that are shared by several multithreaded cores. In this microprocessor design, contention points may appear along the whole memory hierarchy. Moreover, this problem is expected to aggravate in future technologies, since the number of cores and hardware threads, and consequently the size of the shared caches increase with each microprocessor generation. This paper characterizes the impact on performance of the different contention points that appear along the memory subsystem. The analysis shows that some benchmarks are more sensitive to contention in higher levels of the memory hierarchy (e.g., shared L2) than to main memory contention. In this paper, we propose two generic scheduling strategies for CMPs. The first strategy takes into account the available bandwidth at each level of the cache hierarchy. The strategy selects the processes to be coscheduled and allocates them to cores to minimize contention effects. The second strategy also considers the performance degradation each process suffers due to contention-aware scheduling. Both proposals have been implemented and evaluated in a commercial single-threaded quad-core processor with a relatively small two-level cache hierarchy. The proposals reach, on average, a performance improvement by 5.38 and 6.64 percent when compared with the Linux scheduler, while this improvement is by 3.61 percent for an state-of-the-art memory contention-aware scheduler under the evaluated mixes.This work was supported by the Spanish MINECO under Grant TIN2012-38341-C04-01, and by the Universitat Politecnica de Valencia under Grant PAID-05-12 SP20120748.Feliu Pérez, J.; Petit Martí, SV.; Sahuquillo Borrás, J.; Duato Marín, JF. (2014). Cache-Hierarchy contention-aware scheduling in CMPs. IEEE Transactions on Parallel and Distributed Systems. 25(3):581-590. https://doi.org/10.1109/TPDS.2013.61S58159025

Crossref

RiuNet

Improving IBM POWER8 Performance Through Symbiotic Job Scheduling

Author: Eeckhout Lieven
Eyerman Stijn
Feliu-Pérez Josué
Petit Martí Salvador Vicente
Sahuquillo Borrás Julio
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 01/01/2017
Field of study

[EN] Symbiotic job scheduling, i.e., scheduling applications that co-run well together on a core, can have a considerable impact on the performance of processors with simultaneous multithreading (SMT) cores. SMT cores share most of their microarchitectural components among the co-running applications, which causes performance interference between them. Therefore, scheduling applications with complementary resource requirements on the same core can greatly improve the throughput of the system. This paper enhances symbiotic job scheduling for the IBM POWER8 processor. We leverage the existing cycle accounting mechanism to build an interference model that predicts symbiosis between applications. The proposed models achieve higher accuracy than previous models by predicting job symbiosis from throttled CPI stacks, i.e., CPI stacks of the applications when running in the same SMT mode to consider the statically partitioned resources, but without interference from other applications. The symbiotic scheduler uses these interference models to decide, at run-time, which applications should run on the same core or on separate cores. We prototype the symbiotic scheduler as a user-level scheduler in the Linux operating system and evaluate it on an IBM POWER8 server running multiprogram workloads. The symbiotic job scheduler significantly improves performance compared to both an agnostic random scheduler and the default Linux scheduler. Across all evaluated workloads in SMT4 mode, throughput improves by 12.4 and 5.1 percent on average over the random and Linux schedulers, respectively.This work was supported in part by the Spanish Ministerio de Econom ıa y Competitividad (MINECO) and Plan E funds, under grants TIN2015-66972- C5-1-R and TIN2014-62246-EXP, as well as by the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement No. 259295.Feliu-Pérez, J.; Eyerman, S.; Sahuquillo Borrás, J.; Petit Martí, SV.; Eeckhout, L. (2017). Improving IBM POWER8 Performance Through Symbiotic Job Scheduling. IEEE Transactions on Parallel and Distributed Systems. 28(10):2838-2851. https://doi.org/10.1109/TPDS.2017.269170828382851281

Crossref

Ghent University Academic Bibliography

RiuNet

L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors

Author: Duato Marín José Francisco
Feliu Pérez Josué
Petit Martí Salvador Vicente
Sahuquillo Borrás Julio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

© 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Improving the utilization of shared resources is a key issue to increase performance in SMT processors. Recent work has focused on resource sharing policies to enhance the processor performance, but their proposals mainly concentrate on novel hardware mechanisms that adapt to the dynamic resource requirements of the running threads. This work addresses the L1 cache bandwidth problem in SMT processors experimentally on real hardware. Unlike previous work, this paper concentrates on thread allocation, by selecting the proper pair of co-runners to be launched to the same core. The relation between L1 bandwidth requirements of each benchmark and its performance (IPC) is analyzed. We found that for individual benchmarks, performance is strongly connected to L1 bandwidth consumption, and this observation remains valid when several co-runners are launched to the same SMT core. Based on these findings we propose two L1 bandwidth aware thread to core (t2c) allocation policies, namely Static and Dynamic t2c allocation, respectively. The aim of these policies is to properly balance L1 bandwidth requirements of the running threads among the processor cores. Experiments on a Xeon E5645 processor show that the proposed policies significantly improve the performance of the Linux OS kernel regardless the number of cores considered.This work was supported by the Spanish Ministerio de Econom´ıa y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-C04-01; and by Programa de Apoyo a la Investigacion y Desarrollo (PAID-05-12) of the ´ Universitat Politecnica de Val ` encia under Grant SP20120748Feliu Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2013). L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors. IEEE. https://doi.org/10.1109/PACT.2013.6618810

Crossref

RiuNet

Addressing bandwidth contention in SMT multicores through scheduling

Author: Duato Marín José Francisco
Feliu-Pérez Josué
Petit Martí Salvador Vicente
Sahuquillo Borrás Julio
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/06/2014
Field of study

© Owner/Author 2014. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ICS '14 Proceedings of the 28th ACM international conference on Supercomputing; http://dx.doi.org/10.1145/2597652.2600109To mitigate the impact of bandwidth contention, which in some processes can yield to performance degradations up to 40%, we devise a scheduling algorithm that tackles main memory and L1 bandwidth contention. Experimental evaluation on a real system shows that the proposal achieves an average speedup by 5% with respect to Linux.This work was supported by the Spanish Ministerio de Economía y Competitividad (MINECO) and Plan E funds, under Grant TIN2012-38341-C04-01, and by the Intel Early Career Faculty Honor Program Award.Feliu-Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2014). Addressing bandwidth contention in SMT multicores through scheduling. ACM. https://doi.org/10.1145/2597652.2600109

RiuNet